8 research outputs found
Vid2speech: Speech Reconstruction from Silent Video
Speechreading is a notoriously difficult task for humans to perform. In this
paper we present an end-to-end model based on a convolutional neural network
(CNN) for generating an intelligible acoustic speech signal from silent video
frames of a speaking person. The proposed CNN generates sound features for each
frame based on its neighboring frames. Waveforms are then synthesized from the
learned speech features to produce intelligible speech. We show that by
leveraging the automatic feature learning capabilities of a CNN, we can obtain
state-of-the-art word intelligibility on the GRID dataset, and show promising
results for learning out-of-vocabulary (OOV) words.Comment: Accepted for publication at ICASSP 201
Seeing Through Noise: Visually Driven Speaker Separation and Enhancement
Isolating the voice of a specific person while filtering out other voices or
background noises is challenging when video is shot in noisy environments. We
propose audio-visual methods to isolate the voice of a single speaker and
eliminate unrelated sounds. First, face motions captured in the video are used
to estimate the speaker's voice, by passing the silent video frames through a
video-to-speech neural network-based model. Then the speech predictions are
applied as a filter on the noisy input audio. This approach avoids using
mixtures of sounds in the learning process, as the number of such possible
mixtures is huge, and would inevitably bias the trained model. We evaluate our
method on two audio-visual datasets, GRID and TCD-TIMIT, and show that our
method attains significant SDR and PESQ improvements over the raw
video-to-speech predictions, and a well-known audio-only method.Comment: Supplementary video: https://www.youtube.com/watch?v=qmsyj7vAzo